University of Maryland - Info Challenge 2022
Team 32: Danny Rivas, Javan Reuto
Date: 03/05/22
After cleaning our data in the Part One, we are now ready to start analyzing the data.
First, we'll import our libraries.
#Importing libraries
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
import plotly.io as pio
pio.renderers.default='notebook'
After importing relevant libraries, we'll import the removed loans dataset and the dataset will both removed and full loans.
#Importing datasets
ga_removed_df = pd.read_csv('ppp-ga-removed-clean.csv')
ga_df = pd.read_csv('ppp-ga-clean.csv', low_memory=False)
#Delete unnecassary col
del ga_df['Unnamed: 0']
Before comparing Removed Loans to Full Loans, we need to investigate the Removed Loans.
We created a dataframe that groups Removed Loans by business type, average jobs retained and average loan amount. We want to investigate if there are any business types that have more workers that may need higher loans.
pd.set_option('max_columns', 50)
# groups business type and captures the count of jobs retained in each business type
businesstype_jobs = ga_removed_df.groupby(by=['business_type'])[['jobs_retained','amount']].mean().sort_values(by='amount', ascending=False)
businesstype_jobs = businesstype_jobs.reset_index()
businesstype_jobs
| business_type | jobs_retained | amount | |
|---|---|---|---|
| 0 | 501(C)3 – NON PROFIT | 5.833333 | 151043.131667 |
| 1 | CORPORATION | 6.673981 | 69870.952320 |
| 2 | PROFESSIONAL ASSOCIATION | 4.058824 | 65230.117647 |
| 3 | NON-PROFIT ORGANIZATION | 5.968750 | 58371.250000 |
| 4 | SUBCHAPTER S CORPORATION | 3.978378 | 41997.709297 |
| 5 | LIMITED LIABILITY COMPANY(LLC) | 2.872109 | 31833.373299 |
| 6 | LIMITED LIABILITY PARTNERSHIP | 4.257143 | 24723.402286 |
| 7 | PARTNERSHIP | 2.347826 | 24527.479130 |
| 8 | SINGLE MEMBER LLC | 1.633058 | 23126.942512 |
| 9 | QUALIFIED JOINT-VENTURE (SPOUSES) | 1.000000 | 20833.000000 |
| 10 | 501(C)6 – NON PROFIT MEMBERSHIP | 3.000000 | 20525.000000 |
| 11 | SOLE PROPRIETORSHIP | 1.076874 | 19169.341763 |
| 12 | JOINT VENTURE | 4.000000 | 16669.600000 |
| 13 | SELF-EMPLOYED INDIVIDUALS | 1.001456 | 16051.311316 |
| 14 | INDEPENDENT CONTRACTORS | 1.000252 | 14780.551149 |
| 15 | NON-PROFIT CHILDCARE CENTER | 1.000000 | 3400.000000 |
| 16 | EMPLOYEE STOCK OWNERSHIP PLAN(ESOP) | 1.000000 | 2103.000000 |
Observation:
We can see how loans differ in the average number of jobs retained and loan amount requested. On average, corporations had the highest number of jobs retained. Interestingly, organizations that were categorized as Tax-Exempt Nonprofits requested the most amount of money. We can assume that perhaps some of the Tax-Exempt Nonprofits did not need the loans as much as others and were asked to pay back the loans.
While Tax-Exempt Nonprofits recieved the highest loan amount average, this group of organizations didn't recieve the most loans. We will now create a stacked histogram by Business Type and Sum of Loans. We'll color code the bars with LMI indicator.
The LMI indicator shows whether or not a business is in a Low-Moderate income community. According to the SBA, the "critical goal from Congress for the 2021 round of PPP was to reach small and low- and moderate-income (LMI) businesses who have not received the needed relief a forgivable PPP loan provides." In other words, PPP loans were intended on helping relatively-low income communities.
#creating stacked histogram by business type for amount of loans and lmi indicator
gr_hist = px.histogram(ga_removed_df, x='business_type', y='amount',
hover_data=['amount'],
labels={'amount':'Amount',
'business_type':'Business Type'},
color = 'lmi_indicator',
height=600,
template='plotly_dark',
color_discrete_sequence=['LightSkyBlue', 'MediumPurple'],
title='Removed Applicants: Loans Distributed to Business Types by LMI'
)
gr_hist.update_layout(xaxis={'categoryorder': 'total descending'})
gr_hist.show()
Inference:
Over $300 million were dispersed to Sole Proprietorships. Of these loans, more than half went to business that were not in a LMI community. In addition, the majority of loans that were given to other business types were not part of a LMI community. We can assume being in a LMI community, played a role in whether or not a loan was removed.
The North American Industry Classifcation System (NAICS), is used to classify businesses by their activity. Each business in our datasets had a NAICS code, in our code preparation we mapped NAICS name to the codes. Given NAICS name, we can see which business activity appeared the most in the Removed loans dataset.
ga_removed_df= ga_removed_df.sort_values(by='amount')
gr_heat = px.density_heatmap(ga_removed_df, x='naics_name', y='business_type',
height=1200,
labels={'business_type': 'Business Type',
'naics_name':'Business Class'
},
template='plotly_dark',
color_continuous_scale='dense',
title='Removed Loans: Number of Loans for Business Classifications'
)
gr_heat.update_layout(xaxis={'categoryorder': 'total descending'})
gr_heat.show()
With this interactive heatmap, we can zoom in on different areas of the figure. If we zoom in to the bottom left corner we can find insightful information.
Inference:
We can see that businesses that appeared the most in removed loans, were Sole Proprietorships and Independent Contractors. Amongst highest Business Classes that were removed are:
From an accounting standpoint, it's highly likely that these businesses operate on a cash basis. In other words, cash is recognized when it's recieved rather than when a service or product is sold. Typically smaller businesses use this method of accounting which also comes with less public scrutiny from investors, such as the Securities & Exchange commission. What does this mean? These businesses are vulnerable to accounting errors that may be fraudulent.
Using Latitude and Longitude data, we can further investigate what cities recieved the higher loans on average and gauge the economic state of each one. We'll include poverty percentage to see how healthy city economies are. Ideally, we would like to see areas with higher poverty rates with higher loan amounts.
#Group removed data by city
ga_removed_cens = ga_removed_df.groupby('city').mean()
ga_removed_cens.dropna()
#Add access token
mapbox_access_token = 'pk.eyJ1IjoiZGFubnlyaXZhcyIsImEiOiJjbDBjbjhrbXEwMHM4M25teGI3NG96OHcwIn0.WhCiY_NpUYr-GDrcPSbPOg'
px.set_mapbox_access_token(mapbox_access_token)
gr_map_poverty = px.scatter_mapbox(ga_removed_cens,
lat = 'lat',
lon = 'long',
color = 'poverty_percent',
size = 'amount',
color_continuous_scale='BuPu',
height = 900, zoom=6.5,
title='Removed Loans: Poverty Rates & Loan Amounts'
)
gr_map_poverty.update_layout(
mapbox_style = 'dark')
gr_map_poverty.show()
Inference:
We can see how cities with lower poverty rates, recieved the higher loans on average. While all of the loans in this visualization are removed, we can see a trend. Cities that generally have a healthier economy, with more people employed, recieved higher loan amounts. Of course, if these areas are "thriving" then it's possible the SBA saw repayment of those loans fit.
Now we'll compare the removed loans to all the loans in the full dataset. First we'll group the data by LMI hubs. To create a sunburst chart to compare the two datasets.
#Create a new df by grouping
ga_rem = ga_df.groupby(['Removed','hubzone_indicator','lmi_indicator']).mean().reset_index()
ga_sun = px.sunburst(ga_rem, path=['Removed','hubzone_indicator', 'lmi_indicator'],
values='amount',
template='plotly_dark',
color_discrete_sequence=['LightSkyBlue', 'MediumPurple'],
title='All Loans: Removed → HubZone → LMI',
height=600
)
ga_sun.show()
We can gain more insight by looking at city demographics. Below we'll make a scatter plot for each city. We'll look at how Removed and Unremoved loans compare in diversity, poverty levels, and average loan amounts.
ga_df.hubzone_indicator = ga_df.hubzone_indicator.map(dict(Y=1, N=0))
ga_df.lmi_indicator = ga_df.lmi_indicator.map(dict(Y=1, N=0))
#ga_df.lmi_indicator = ga_df.lmi_indicator.map(dict(Y=1, N=0))
df_census = ga_df.groupby(['Removed','city']).mean().reset_index()
df_scatter = px.scatter(df_census, x='poverty_percent',y='minority_percent',
size='amount',
color='Removed',
hover_name='population',
labels={'minority_percent': 'Diversity',
'poverty_percent':'Poverty Level'
},
template='plotly_dark',
color_continuous_scale='Pinkyl',
trendline='ols',
title='Amounts of Loans Given in Diverse or Impoverished Cities',
height=800,
log_x=True, size_max=60)
#fig.xaxis(scaleanchor = "x", scaleratio = 1)
df_scatter.show()
Observation:
Above, we can see there is a relatively strong correlation between diversity levels and poverty levels. As Diversity, or number of minorities in a city increase, so does the level of poverty. In addition, the size of each marker represents the average loan size for a specific city. The higher the marker the higher the average loan size in that city. We can see that cites that had lower diversity levels recieved more loans and larger loan sizes. While this chart does not give us a reason as to why some loans were removed, we can see that loans that were removed recieved smaller loans on average.
The last metric we want to look at are loans approved over time. This may provide insight on what time of the year was it ideal to apply for a loan.
xy = pd.crosstab(index=ga_df.date_approved, columns=ga_df['Removed']).reset_index()
vs_time = px.line(xy, x='date_approved', y=xy.columns,template='plotly_dark',
labels={'value': 'Number of Loans',
'date_approved':'Time'
},
color_discrete_sequence=['LightSkyBlue', 'MediumPurple'],
title='Loans Approved Over Time',
height=800)
vs_time.show()
df = px.data.stocks()
Inference:
At the beginning of the pandemic, we see a peak in number of loans approved. These loans all went to loans that were not removed by the SBA. Towards the beginning of March 2021, we can see that loans that were eventually removed from the dataset, started to appear. It's possible that loans that were removed, were loans that were not forgiven. We know now that loans given at the beginning of the pandemic were eventually forgiven. Given there was a second wave in COVID-19 cases, it's possible that more people started to apply, and the SBA had measures in place to perhaps better determine the qualifcations of a loan applicant.
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
#Loading clean full ga loan application dataset
ga_full_df = pd.read_csv('ppp-ga-full-clean.csv', low_memory=False)
#Delete unnecassary col
del ga_full_df['Unnamed: 0']
#Creating function to split dataframe into chunks
def even_split(df, df2):
#Nested function to shuffle dataframe
def shuffle(df):
df = df.sample(frac=1)
return df
#Shuffling dataframe
df = shuffle(df)
#Creating ratio to break dataframe to chunk
numerator = len(df2.index)
denominator = numerator+len(df.index)
fraction = numerator/denominator
df_length = (len(df) * fraction)
df_length = round(df_length)
#breaking dataframe into chunk using iloc
df_short = df.iloc[0:df_length , :]
return df_short
ga_chunk = even_split(ga_full_df, ga_removed_df)
#Concat chunk and removed dataframe
ga_1_balanced = pd.concat([ga_chunk, ga_removed_df]).reset_index()
#Keep only variables for the model
ga_1_balanced = ga_1_balanced[['amount',
'city',
'business_type',
'jobs_retained',
'date_approved',
'congressional_district',
'sba_office_code',
'processing_method',
'loan_status',
'term',
'initial_approval_amount',
'current_approval_amount',
'undisbursed_amount',
'servicing_lender_state',
'rural_urban_indicator',
'hubzone_indicator',
'business_age_description',
#'project_state',
'originating_lender_state',
'forgiveness_amount',
'naics_name',
'population',
'median_income',
'minority_percent',
'poverty_percent',
'Removed']]
ga_1_balanced
| amount | city | business_type | jobs_retained | date_approved | congressional_district | sba_office_code | processing_method | loan_status | term | initial_approval_amount | current_approval_amount | undisbursed_amount | servicing_lender_state | rural_urban_indicator | hubzone_indicator | business_age_description | originating_lender_state | forgiveness_amount | naics_name | population | median_income | minority_percent | poverty_percent | Removed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12099.37 | FORSYTH | PARTNERSHIP | 1 | 2020-04-27 | GA-08 | 405 | PPP | PAID IN FULL | 24 | 12099.37 | 12099.37 | 0.0 | GA | R | N | EXISTING OR MORE THAN 2 YEARS OLD | GA | 12180.92 | Other Personal and Household Goods Repair and ... | 16322.0 | 53498.0 | 0.295675 | 0.144896 | 0 |
| 1 | 5555.00 | MONROE | SELF-EMPLOYED INDIVIDUALS | 1 | 2021-03-15 | GA-10 | 405 | PPS | EXEMPTION 4 | 60 | 5555.00 | 5555.00 | 0.0 | TX | R | Y | EXISTING OR MORE THAN 2 YEARS OLD | TX | 5584.78 | Janitorial Services | 26516.0 | 49722.0 | 0.286167 | 0.173254 | 0 |
| 2 | 11812.00 | SNELLVILLE | LIMITED LIABILITY COMPANY(LLC) | 1 | 2021-05-12 | GA-04 | 405 | PPS | EXEMPTION 4 | 60 | 11812.00 | 11812.00 | 0.0 | AZ | U | Y | EXISTING OR MORE THAN 2 YEARS OLD | AZ | 0.00 | Legal Counsel and Prosecution | 47778.0 | 75546.0 | 0.724811 | 0.098518 | 0 |
| 3 | 7983.00 | ATLANTA | SELF-EMPLOYED INDIVIDUALS | 1 | 2021-01-25 | GA-05 | 405 | PPS | EXEMPTION 4 | 60 | 7983.00 | 7983.00 | 0.0 | GA | U | N | EXISTING OR MORE THAN 2 YEARS OLD | GA | 8030.02 | Other Accounting Services | 66658.0 | 48347.0 | 0.984263 | 0.206922 | 0 |
| 4 | 395304.24 | CARTERSVILLE | CORPORATION | 25 | 2021-04-13 | GA-11 | 405 | PPS | EXEMPTION 4 | 60 | 395304.24 | 395304.24 | 0.0 | AR | R | N | EXISTING OR MORE THAN 2 YEARS OLD | AR | 0.00 | Hotels (except Casino Hotels) and Motels | 24742.0 | 56232.0 | 0.191941 | 0.113936 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 50515 | 985024.00 | BROOKHAVEN | LIMITED LIABILITY COMPANY(LLC) | 34 | 2021-04-21 | GA-06 | 405 | PPP | ACTIVE UN-DISBURSED | 60 | 985024.00 | 985024.00 | 985024.0 | PA | U | Y | EXISTING OR MORE THAN 2 YEARS OLD | PA | 0.00 | Offices of Other Holding Companies | 43011.0 | 104435.0 | 0.225407 | 0.069471 | 1 |
| 50516 | 1007400.00 | MARIETTA | CORPORATION | 201 | 2021-03-31 | GA-06 | 405 | PPS | ACTIVE UN-DISBURSED | 60 | 1007400.00 | 1007400.00 | 1007400.0 | WV | U | Y | EXISTING OR MORE THAN 2 YEARS OLD | WV | 0.00 | Temporary Help Services | 33637.0 | 51878.0 | 0.651931 | 0.167494 | 1 |
| 50517 | 1600000.00 | CONYERS | 501(C)3 – NON PROFIT | 15 | 2021-05-18 | GA-13 | 405 | PPP | ACTIVE UN-DISBURSED | 60 | 1600000.00 | 1600000.00 | 1600000.0 | NJ | U | N | EXISTING OR MORE THAN 2 YEARS OLD | NJ | 0.00 | Religious Organizations | 28591.0 | 47411.0 | 0.505858 | 0.206149 | 1 |
| 50518 | 1624974.00 | DORAVILLE | LIMITED LIABILITY COMPANY(LLC) | 78 | 2021-05-23 | GA-06 | 405 | PPP | ACTIVE UN-DISBURSED | 60 | 1624974.00 | 1624974.00 | 1624974.0 | CA | U | N | EXISTING OR MORE THAN 2 YEARS OLD | CA | 0.00 | Other Direct Selling Establishments | 27796.0 | 48793.0 | 0.473090 | 0.186646 | 1 |
| 50519 | 10000000.00 | FAIRBURN | SOLE PROPRIETORSHIP | 495 | 2021-04-10 | GA-13 | 405 | PPP | ACTIVE UN-DISBURSED | 60 | 10000000.00 | 10000000.00 | 10000000.0 | NY | U | N | EXISTING OR MORE THAN 2 YEARS OLD | NY | 0.00 | Cosmetics, Beauty Supplies, and Perfume Stores | 33926.0 | 58016.0 | 0.829305 | 0.126245 | 1 |
50520 rows × 25 columns
o = ga_1_balanced['Removed']
o.value_counts()
o.value_counts().plot.pie(autopct='%.2f')
<AxesSubplot:ylabel='Removed'>
ga_1_balanced['naics_name'].fillna('other', inplace = True)
ga_1_balanced['minority_percent'].fillna(0, inplace = True)
ga_1_balanced['poverty_percent'].fillna(0, inplace = True)
ga_1_balanced['population'].fillna(ga_1_balanced['population'].mean(),inplace=True)
ga_1_balanced['median_income'].fillna(ga_1_balanced['median_income'].mean(), inplace=True)
ga_1_balanced.isna().sum()
amount 0 city 0 business_type 0 jobs_retained 0 date_approved 0 congressional_district 1 sba_office_code 0 processing_method 0 loan_status 0 term 0 initial_approval_amount 0 current_approval_amount 0 undisbursed_amount 0 servicing_lender_state 0 rural_urban_indicator 0 hubzone_indicator 0 business_age_description 0 originating_lender_state 0 forgiveness_amount 0 naics_name 0 population 0 median_income 0 minority_percent 0 poverty_percent 0 Removed 0 dtype: int64
ga_1_balanced.processing_method = ga_1_balanced.processing_method.replace({'PPP': 1, 'PPS': 0})
ga_1_balanced.sba_office_code = ga_1_balanced.sba_office_code.replace({405: 1, 455: 0, 474: 2})
ga_1_balanced.loan_status = ga_1_balanced.loan_status.replace({'PAID IN FULL': 1, 'EXEMPTION 4': 0, 'ACTIVE UN-DISBURSED': 2})
ga_1_balanced.rural_urban_indicator = ga_1_balanced.rural_urban_indicator.replace({'U': 1, 'R': 0})
ga_1_balanced.hubzone_indicator = ga_1_balanced.hubzone_indicator.replace({'N': 1, 'Y': 0})
#ga_1_balanced.project_state = ga_1_balanced.project_state.replace({'GA': 1, 'FL': 0})
def category_encoder(df, column):
column_name = tuple(df[column].unique())
column_name_df = pd.DataFrame(column_name, columns=['column_name_1'])
labelencoder = LabelEncoder()
column_name_df[f'column_name_cat'] = labelencoder.fit_transform(column_name_df['column_name_1'])
column_name_df = column_name_df.set_index('column_name_1')
encoded_dictionary = column_name_df.to_dict()['column_name_cat']
df[column] = df[column].map(encoded_dictionary)
return df
congressional_district = category_encoder(ga_1_balanced, 'congressional_district')
city = category_encoder(ga_1_balanced, 'city')
business_type = category_encoder(ga_1_balanced, 'business_type')
date_approved = category_encoder(ga_1_balanced, 'date_approved')
servicing_lender_state = category_encoder(ga_1_balanced, 'servicing_lender_state')
business_age_description = category_encoder(ga_1_balanced, 'business_age_description')
originating_lender_state = category_encoder(ga_1_balanced, 'originating_lender_state')
naics_name = category_encoder(ga_1_balanced, 'naics_name')
from sklearn.model_selection import train_test_split
print(tuple(ga_1_balanced['term'].unique()))
ga_1_balanced.nunique()
(24, 60, 59, 6, 5, 55, 9, 8, 23)
amount 15829 city 530 business_type 21 jobs_retained 185 date_approved 235 congressional_district 17 sba_office_code 4 processing_method 2 loan_status 3 term 9 initial_approval_amount 15790 current_approval_amount 15829 undisbursed_amount 6025 servicing_lender_state 43 rural_urban_indicator 2 hubzone_indicator 2 business_age_description 5 originating_lender_state 43 forgiveness_amount 14420 naics_name 855 population 504 median_income 509 minority_percent 510 poverty_percent 513 Removed 2 dtype: int64
import plotly.express as px
fig = px.imshow(ga_1_balanced.corr(),aspect='equal', height=800,
template='plotly_dark',
color_continuous_scale='PuOr')
fig.show()
%matplotlib inline
import seaborn as sns sns.set_style('dark')
plt.figure(figsize=(12,8)) sns.heatmap(ga_1_balanced.corr(), cmap='PuBu', annot=True, fmt='.1f', linewidths=.1) plt.show()
ga_1_balanced.isna().sum()
amount 0 city 0 business_type 0 jobs_retained 0 date_approved 0 congressional_district 0 sba_office_code 0 processing_method 0 loan_status 0 term 0 initial_approval_amount 0 current_approval_amount 0 undisbursed_amount 0 servicing_lender_state 0 rural_urban_indicator 0 hubzone_indicator 0 business_age_description 0 originating_lender_state 0 forgiveness_amount 0 naics_name 0 population 0 median_income 0 minority_percent 0 poverty_percent 0 Removed 0 dtype: int64
X = ga_1_balanced.iloc[:, ga_1_balanced.columns != 'Removed']
y = np.ravel(ga_1_balanced.iloc[:, ga_1_balanced.columns == 'Removed'])
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
array([[5986, 168],
[ 518, 5958]])
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="PuBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 257.44, 'Predicted label')
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
Accuracy: 0.9456848772763262 Precision: 0.9725759059745348 Recall: 0.9200123533045089
y_score = lr.predict_proba(X_test)[::,1]
fpr, tpr, thresh = metrics.roc_curve(y_test, y_score)
auc = metrics.roc_auc_score(y_test, y_score)
roc = px.area(x=fpr, y=tpr,
title=f'ROC Curve (AUC={auc:.4f})',
labels=dict(x='False Positive Rate',
y='True Positive Rate'),
height=700,
template='plotly_dark'
)
roc.add_shape(type='line', line=dict(dash='dash'),x0=0, x1=1, y0=0, y1=1)
roc.update_yaxes(scaleanchor="x", scaleratio=1)
roc.update_xaxes(constrain='domain')
roc.show()